0%

又一次不接受Elasticsearch官方建议导致的事故

记录一下
一次Elasticsearch集群事故分析、排查、处理

背景

上午同事拉群反馈他们在用的一个ES集群出现写入超时的情况,查看ES集群状态发现已经是RED了

  • 这是一个3 * 8核32G(jvm堆配置了16G)的集群
  • 出问题的时候,集群内有5000个索引,每个索引shard数是6,副本数是1,也就是一个索引是12个shard
    • shard数已经达到了6w个,这不包含ES自带的系统索引。
  • 集群的实例重启过
  • 数据量大约480G

看到这里是不是有一种似曾相识的感觉?
=> 一次不接受Elasticsearch官方建议导致的事故

立马联系业务删除历史数据,业务这边明确可以删除22年、23年的历史数据。

处理

调速

在业务执行批量删除后,我这边先调大了集群恢复相关的参数

1
2
3
4
5
6
7
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.node_concurrent_recoveries": 200,
"indices.recovery.max_bytes_per_sec": "80mb"
}
}

关于恢复速度相关参数解释:https://elastic.ac.cn/guide/en/elasticsearch/reference/8.17/recovery.html

在集群恢复的期间,ES实例一直在输出WARN日志

1
2
3
4
Unable to acquire permit to use snapshot files during recovery, 
this recovery will recover index files from the source node.
Ensure snapshot files can be used during recovery by setting
[indices.recovery.max_concurrent_snapshot_file_downloads] to be no greater than [25]

看了下集群设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET /_cluster/settings?include_defaults=true

{
"recovery": {
"recovery_activity_timeout": "1800000ms",
"retry_delay_network": "5s",
"internal_action_timeout": "15m",
"max_concurrent_snapshot_file_downloads_per_node": "25",
"retry_delay_state_sync": "500ms",
"max_concurrent_snapshot_file_downloads": "5",// 这里并没有比25大
"internal_action_long_timeout": "1800000ms",
"max_concurrent_operations": "1",
"use_snapshots": "true",
"max_concurrent_file_chunks": "2"
}
}

所以先忽略,恢复后,该日志就不再输出了。

排查

找了一个ES节点下载最近几天的日志,排查实例退出的原因

实例退出前有大量如下异常输出

1
2
3
4
5
this may be caused by lack of enough unfragmented virtual address space 
or too restrictive virtual memory limits enforced by the operating system,
preventing us to map a chunk of 414627 bytes. Please review 'ulimit -v', 'ulimit -m' (both should return 'unlimited'),
and 'sysctl vm.max_map_count'. More information:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

这里看了一下max_map_count没有问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# ulimit -m
unlimited
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# ulimit -v
unlimited
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]#
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# sysctl vm.max_map_count
vm.max_map_count = 262144
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# docker exec -it 21871c061981 /bin/bash
root@prd-jiankunking-notify-es-3:/usr/share/elasticsearch# ulimit -m
unlimited
root@prd-jiankunking-notify-es-3:/usr/share/elasticsearch# ulimit -v
unlimited
root@prd-jiankunking-notify-es-3:/usr/share/elasticsearch# sysctl vm.max_map_count
vm.max_map_count = 262144

262144是elasticsearch 7.16.2启动的时候提示的最小值

找到实例退出时间点的日志

1
2
3
4
5
6
7
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fe4d805c000, 65536, 1) failed; error='Not enough space' (errno=12)

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
[thread 330 also had an error][thread 232 also had an error]
# An error report file with more information is saved as
# logs/hs_err_pid7.log

再看下logs/hs_err_pid7.log的日志内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# JVM is running with Zero Based Compressed Oops mode in which the Java heap is
# placed in the first 32GB address space. The Java Heap base address is the
# maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
# to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2728), pid=7, tid=213
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.1+12 (17.0.1+12) (build 17.0.1+12)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (17.0.1+12, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: /usr/share/elasticsearch/core.7
#

--------------- S U M M A R Y ------------

Command Line: -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -Xms16g -Xmx16g -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-18266345477890424435 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -XX:MaxDirectMemorySize=8589934592 -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=25 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=default -Des.distribution.type=docker -Des.bundled_jdk=true org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

Host: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz, 8 cores, 31G, Ubuntu 20.04.3 LTS
Time: Wed Jan 8 00:44:25 2025 UTC elapsed time: 28895.244676 seconds (0d 8h 1m 35s)

结论

到这里可以看出是ES在申请Native memory allocation (mmap) 的时候内存不足,进程退出了。

数据删除后,通过

1
GET /_cat/health?v

看到active_shards_percent恢复进度到100%后,但还有19个shard unassign

看下具体没有分配的原因

1
GET /_cluster/allocation/explain

返回内容如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"reason" : "ALLOCATION_FAILED",
"at" : "2025-01-07T16:28:22.026Z",
"failed_allocation_attempts" : 5,
"details" : """failed shard on node [IYowNoBkTzq_WPjXgb_10Q]:
failed recovery, failure RecoveryFailedException[[notification-jiankunking-2025.01][3]:
Recovery failed on {es-notify-203}{IYowNoBkTzq_WPjXgb_10Q}
{nzTGfgdBQvykqKacU_cyzA}{127.0.0.203}{127.0.0.203:9301}{cdfhilmrstw}{ml.machine_memory=33568223232, xpack.installed=true,
transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=17179869184}];
nested: IndexShardRecoveryException[failed to recover from gateway]; nested:
EngineCreationFailureException[failed to open reader on writer]; nested:
IOException[Map failed: MMapIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/n5E4PEFpTKKDgGdNmEqSPQ/3/index/_j50_Lucene80_0.dvd")
[this may be caused by lack of enough unfragmented virtual address space or too restrictive virtual memory limits enforced by the operating system,
preventing us to map a chunk of 4321640 bytes. Please review 'ulimit -v', 'ulimit -m' (both should return 'unlimited'),
and 'sysctl vm.max_map_count'. More information:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html]]; """,
"last_allocation_status" : "no"
}

手动重新分配下

1
POST /_cluster/reroute?retry_failed=true

集群状态red->yellow->green

业务清理历史数据后,集群

  • shard数降为4w
  • 数据量降为335.3 GB

欢迎关注我的其它发布渠道